Skip to content

fix(validator): fail closed on unreadable PAT store instead of wiping it#1486

Open
anderdc wants to merge 1 commit into
testfrom
fix/pat-store-fail-closed
Open

fix(validator): fail closed on unreadable PAT store instead of wiping it#1486
anderdc wants to merge 1 commit into
testfrom
fix/pat-store-fail-closed

Conversation

@anderdc

@anderdc anderdc commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Fixes #1481

Targets test. Verified against origin/test @ 1c813f5; full suite green (uv run pytest tests/ → 940 passed), ruff lint + format clean, pyright clean on changed files.

What this fixes

The root cause in #1481 (Proof 2): a single failed read of miner_pats.json silently and permanently erases every stored PAT.

_read_file() caught (json.JSONDecodeError, OSError) and returned [], so it could not tell "the store is empty" from "the read just failed" (a corrupt file, a transient I/O blip, a partial write). The next save_pat() read that [], upserted the one incoming PAT, and atomically overwrote the file — wiping every other miner's stored PAT. Those miners are then scored 0 by that validator every round (No stored PAT for miner {uid}, inspections.py:134) until each manually re-posts, with no error and no signal.

The change (validator-side, ~a handful of lines)

  • _read_file() fails loud — returns [] only when the file genuinely does not exist; raises on a corrupt/unreadable file.
  • save_pat() / get_pat_by_uid() fail closed — they propagate that error, so the write path can no longer mistake a failed read for an empty store and overwrite it. A failed read raises before any write; the existing store is left intact.
  • handle_pat_broadcast rejects gracefully — catches the store-read error and returns a clear "PAT store temporarily unavailable; please retry" rejection, so the miner retries instead of the axon surfacing a raw error.
  • load_all_pats() stays tolerant — it is the read-only scoring snapshot; an unhandled error there would stop the validator (the run loop's except sits outside while True) or zero everyone. It now logs loudly and returns [], so a transient read failure recovers on the next round and never wipes anything.

Why this is the appropriate fix

The store loss is the actual bug, and stopping the wipe protects every miner on every validator with no miner action — it doesn't ask honest miners to run a re-broadcast daemon to paper over validator-side data loss. It applies the same principle already adopted in #931 / #932 / #1107: a transient condition must not produce a permanent wrong outcome. Failing closed and loud also surfaces a genuinely corrupt store to the operator instead of silently degrading it down to a single entry.

Tests

tests/validator/test_pat_storage.py (#1481 Proof 2 → regression coverage):

  • a transient read error makes save_pat raise and leaves the on-disk store fully intact (all prior entries survive; the incoming UID is never written);
  • a corrupt store is left as-is on save, not overwritten down to one entry.

tests/validator/test_pat_handler.py:

  • an unreadable store makes handle_pat_broadcast reject (accepted=False, "temporarily unavailable") rather than wiping the store or raising.
uv run pytest tests/   -> 940 passed
ruff check / format    -> clean
pyright (changed)      -> 0 errors

Follow-up (not in this PR)

This closes the code path that lets a well-configured validator wipe itself. It does not cover PAT loss from a validator restarting without a persistent ./data volume, or stuck in a crash loop — those are operator config, not code. The follow-up is to reach out to the validators currently missing coverage (the vali-116 / rt21-64 cases in #1481) to confirm they mount ./data:/app/data and aren't crash-looping.

A single failed read of miner_pats.json (corrupt file or transient I/O)
made _read_file return [], and the next save_pat upserted into that []
and atomically overwrote the store — erasing every other miner's stored
PAT, who were then silently scored 0 until each re-posted.

_read_file now raises on a corrupt/unreadable file (returns [] only when
the file genuinely does not exist). save_pat/get_pat_by_uid propagate, so
the write path fails closed and the broadcast handler rejects (miner
retries) instead of overwriting. load_all_pats stays tolerant so a
transient read failure cannot crash the scoring round.

Fixes #1481
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] gitt miner post coverage silently and permanently erodes after validator restarts — valid miners are de-scored with no signal and no recovery

1 participant